Speech Recognition Using Librosa

Introduction to Librosa

Speech recognition has become a pivotal technology in various applications, from virtual assistants to transcription services. One popular Python library that aids in processing and analyzing audio signals for speech recognition is Librosa. In this blog, we will explore how to use Librosa for basic speech recognition tasks and understand its role in audio analysis.

Importing libraries

Library/Module Description
torch A deep learning framework for building and training neural networks.
transformers.Wav2Vec2ForCTC A pre-trained model for automatic speech recognition using Connectionist Temporal Classification.
transformers.Wav2Vec2Tokenizer A tokenizer to preprocess audio inputs for the Wav2Vec2 model.
transformers.pipeline A simple interface for running tasks like speech-to-text, translation, and more using pre-trained models.
librosa A Python library for audio analysis and feature extraction.
vaderSentiment.SentimentIntensityAnalyzer A tool for sentiment analysis, particularly effective for short texts like sentences. Ipython

Reading Audio File

The voice data features diverse sentences focusing on sensory experiences and cultural food references, making it ideal for testing speech recognition and contextual comprehension.

Audio('harvard.mp3')

Implementation Steps

1. Load the Wav2Vec 2.0 Model and Tokenizer

  • The Wav2Vec2Tokenizer and Wav2Vec2ForCTC are loaded from the pre-trained Wav2Vec 2.0 deep learning model provided by Facebook to transcribe spoken words into written text .

  • The Wav2Vec2Tokenizer is responsible for converting raw audio input into a format that the Wav2Vec 2.0 model can understand

  • The Wav2Vec2ForCTC class represents the Wav2Vec 2.0 model itself, specifically designed for CTC (Connectionist Temporal Classification) training

# Load the Wav2Vec 2.0 model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")

2. Load and Preprocess the Audio

  • The audio file harvard.wav is loaded using librosa.load

y: Contains the audio data (samples)

sr: The sample rate (16,000 Hz in this case), which is the standard rate for the Wav2Vec 2.0 model Higher sample rate captures more details, providing a clearer and higher-resolution audio file

  • The raw audio data (y) is tokenized using the tokenizer It converts the audio into a format that the model can understand and prepares it for input. The return_tensors=“pt” option specifies that the output should be in PyTorch tensor format
# Load and preprocess the audio
audio_file = "harvard.wav"
y, sr = librosa.load(audio_file, sr=16000)  # Wav2Vec2 works best with 16kHz audio
input_values = tokenizer(y, return_tensors="pt", padding="longest").input_values

3. Perform Speech Recognition

  • torch.no_grad(): This context manager disables gradient calculations, saving memory and computation since we are only interested in inference(i.e., generate predictions, not training

  • The model outputs logits, which are raw predictions. torch.argmax is used to find the most likely predicted tokens The predicted token IDs are then decoded back into human-readable text using the tokenizer

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

# Decode the output
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.decode(predicted_ids[0])

print("Transcribed text:", transcription)
Transcribed text: THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOUR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TAKO'S AL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN

4. Sentiment Analysis with VADER

  • Once the text is transcribed, we use VADER (Valence Aware Dictionary and sEntiment Reasoner) to perform sentiment analysis.

  • This tool provides sentiment scores for positive, neutral, and negative sentiments, along with a compound score representing the overall sentiment.

# Initialize VADER sentiment analyzer
vader_analyzer = SentimentIntensityAnalyzer()

# Perform VADER sentiment analysis
vader_scores = vader_analyzer.polarity_scores(transcription)

# Output the VADER sentiment analysis results
print("VADER Sentiment Analysis Result:")
print("Positive:", vader_scores['pos'])
print("Neutral:", vader_scores['neu'])
print("Negative:", vader_scores['neg'])
print("Compound:", vader_scores['compound'])
VADER Sentiment Analysis Result:
Positive: 0.149
Neutral: 0.851
Negative: 0.0
Compound: 0.7184

Conclusion:

Integrating Wav2Vec 2.0 for speech-to-text and VADER for sentiment analysis bridges the gap between audio data and insights. Speech recognition transcribes spoken language, while sentiment analysis uncovers emotional tones. These techniques enhance NLP and audio analysis, enabling smarter, real-world applications.